-
Notifications
You must be signed in to change notification settings - Fork 12
docs: Clarify CLI help text to align with databus hierarchy #47
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
docs: Clarify CLI help text to align with databus hierarchy #47
Conversation
📝 WalkthroughWalkthroughUpdated user-facing help text for deploy CLI options to clarify artifact-level vs version-level fields; renamed parameters in the API function Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes 🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
🧪 Generate unit tests (beta)
📜 Recent review detailsConfiguration used: Organization UI Review profile: CHILL Plan: Pro 📒 Files selected for processing (2)
🚧 Files skipped from review as they are similar to previous changes (1)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Nitpick comments (4)
databusclient/cli.py (2)
14-56: Consider several improvements to the parsing logic.
Warning mechanism (Line 49): Using
print()for warnings writes directly to stdout, which is not ideal for library code. Consider using Python'sloggingmodule or collecting warnings in the return value for better testability and control.Compression heuristic is fragile (Lines 36-39): The classification of
.taras compression is questionable since.taris often treated as an archive format, not a compression format. Files like.tar.gzhave both archiving and compression. This heuristic may misclassify extensions and lead to incorrect metadata.Missing docstring examples: The docstring should include concrete examples showing input and expected output to clarify the expected format and behavior.
No validation of key=value pairs (Line 43): The code splits on
'='but doesn't validate that the key or value is non-empty. Consider adding validation to reject malformed variants like"=value"or"key=".Inconsistent extension handling: Line 37's comment suggests conditional removal of the leading dot ("if needed"), but the code always strips it. Clarify the expected format or document why the leading dot is always removed.
🔎 Suggested improvements
Replace
print()with proper logging:+import logging + +logger = logging.getLogger(__name__) + def parse_distribution_str(dist_str: str): """ Parses a distribution string with format: URL|key=value|...|.extension Returns a dictionary suitable for the deploy API. + + Examples: + >>> parse_distribution_str("http://example.com/data.json|lang=en|.json|.gz") + {'url': 'http://example.com/data.json', 'variants': {'lang': 'en'}, + 'formatExtension': 'json', 'compression': 'gz'} """ ... else: - print(f"WARNING: Unrecognized modifier '{part}' in distribution. Expected '.ext' or 'key=val'.") + logger.warning(f"Unrecognized modifier '{part}' in distribution. Expected '.ext' or 'key=val'.")Add validation for key=value pairs:
elif '=' in part: key, value = part.split('=', 1) + if not key.strip() or not value.strip(): + logger.warning(f"Invalid key=value pair '{part}' in distribution.") + continue variants[key.strip()] = value.strip()
129-137: Remove developer scaffolding comments and consider error handling.
Lines 129, 137: The
"--- CHANGE START/END ---"comments are developer scaffolding and should be removed before merging.Line 131: Consider adding error handling around
parse_distribution_strcalls. If parsing fails (e.g., malformed distribution string), the user should receive a clear error message rather than an uncaught exception.Behavioral change: This modification changes the contract between the CLI and
api_deploy.create_dataset. Previously, raw strings were passed; now parsed dictionaries are passed. While this appears to be intentional and supported by changes indeploy.py, it's worth verifying that all downstream consumers can handle this change.🔎 Proposed cleanup
- # --- CHANGE START --- - # Parse the input strings into structured objects parsed_distributions = [parse_distribution_str(d) for d in distributions] - # Note: api_deploy.create_dataset now accepts this list of dicts dataid = api_deploy.create_dataset( version_id, title, abstract, description, license_url, parsed_distributions ) - # --- CHANGE END ---databusclient/api/queries.py (1)
54-93: Consider adding validation for malformed key=value pairs.Similar to
parse_distribution_strincli.py, this function splits on'='(Line 87) but doesn't validate that both key and value are non-empty. Consider adding a check to handle edge cases like"=value"or"key=".🔎 Suggested validation
if "=" in part: key, value = part.split("=", 1) + if not key.strip() or not value.strip(): + # Skip malformed key=value pairs + continue variants[key.strip()] = value.strip()tests/test_parse_distribution.py (1)
1-275: Excellent test coverage with minor gaps.The test suite is comprehensive and well-structured:
- Clear test organization with descriptive test names
- Good coverage of parsing logic: URL extraction, variants, extensions, compression
- Appropriate use of mocking to avoid network dependencies
- Integration tests validate end-to-end behavior with the deploy API
Minor suggestions for additional test coverage:
Error handling: Consider adding tests for truly malformed inputs that might cause exceptions (e.g., None input, non-string input, extremely long strings).
Edge case - URLs with delimiter: Test behavior when the URL itself contains the
'|'delimiter character (e.g.,"http://example.com/file?param=a|b|.json"). The current implementation would split incorrectly.Empty parts: Test distribution strings with empty parts between delimiters (e.g.,
"http://example.com/file||.json"with double pipe).🔎 Example additional tests
def test_url_with_pipe_character(self): """Test URL containing pipe character in query string.""" # This is an edge case that might need special handling result = parse_distribution_str("http://example.com/file?param=a|b") # Current implementation would incorrectly split on the pipe in the URL assert result["url"] == "http://example.com/file?param=a|b" def test_empty_parts_between_delimiters(self): """Test handling of empty parts between delimiters.""" result = parse_distribution_str("http://example.com/file||.json") # Should gracefully handle empty parts assert result["formatExtension"] == "json"
📜 Review details
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (5)
databusclient/api/deploy.pydatabusclient/api/queries.pydatabusclient/cli.pytests/test_deploy.pytests/test_parse_distribution.py
🧰 Additional context used
🧬 Code graph analysis (2)
databusclient/cli.py (1)
databusclient/api/deploy.py (1)
create_dataset(304-465)
tests/test_parse_distribution.py (2)
databusclient/cli.py (1)
parse_distribution_str(14-56)databusclient/api/deploy.py (1)
_get_file_info_from_dict(176-208)
🔇 Additional comments (4)
databusclient/cli.py (1)
73-75: LGTM! Help text improvements align with PR objectives.The updated help text successfully clarifies the distinction between Artifact-level metadata (permanent name) and Version-level metadata (release-specific). The terminology now explicitly maps to the DBpedia Databus data model.
databusclient/api/deploy.py (1)
304-387: LGTM! Backwards-compatible API extension.The extension of
distributionsparameter to acceptUnion[List[str], List[Dict]]is well-implemented:
- Backwards compatible with existing string-based distributions
- Clear documentation in the docstring (lines 334-337)
- Appropriate use of
isinstance()check to branch on format (line 368)- Both branches produce the same tuple structure for consistent downstream processing
The approach correctly maintains the existing validation (lines 389-393) for both formats.
tests/test_deploy.py (1)
15-16: LGTM! Good test hygiene.Extracting the example URL into a named constant improves test maintainability and readability.
databusclient/api/queries.py (1)
1-51: This module introduces unused code unrelated to the PR's stated objective.
ONTOLOGIES_QUERYandparse_content_variants_string()are defined in this new module but are not imported or used anywhere in the codebase. Since the PR aims to clarify CLI help text, this feature addition appears out of scope.Either:
- Move this to a separate PR with proper documentation of the feature's purpose and integration points, or
- Integrate it into the codebase (add imports where needed and update the PR description to reflect this feature work)
Additionally, document why these functions are being added if they're intended for future use.
c6316e2 to
aca668e
Compare
|
Greetings @Integer-Ctrl @m1ci could you please review this pr ? |
Kindly review it. |
Integer-Ctrl
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As you said, it is a good idea to keep the existing CLI arguments for now to avoid breaking changes. However, perhaps you could change the variable names in the method create_dataset in the deploy.py file:
- title -> artifact_title
- abstract -> artifact_abstract
- description -> artifact_description
This would improve clarity.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (4)
databusclient/api/deploy.py (4)
287-296: Update docstring to match the new parameter names.The docstring still documents the old parameter names (
title,abstract,description) instead of the new names (artifact_title,artifact_abstract,artifact_description).📝 Proposed fix for the docstring
version_id: str The version ID representing the Dataset. Needs to be in the form of $DATABUS_BASE/$ACCOUNT/$GROUP/$ARTIFACT/$VERSION - title: str - The title text of the dataset - abstract: str - A short (one or two sentences) description of the dataset - description: str - A long description of the dataset. Markdown syntax is supported + artifact_title: str + The artifact title (permanent name of the data series, applies to all versions) + artifact_abstract: str + The artifact abstract (a short description of the artifact) + artifact_description: str + The artifact description (detailed metadata for the artifact). Markdown syntax is supported license_url: str The license of the dataset as a URI.
501-507: Update the call tocreate_datasetwith the new parameter names.The
deploy_from_metadatafunction still callscreate_datasetusing the old parameter names (title,abstract,description), which will cause aTypeErrorsince those parameters no longer exist.🔧 Proposed fix
dataset = create_dataset( version_id=version_id, - title=title, - abstract=abstract, - description=description, + artifact_title=title, + artifact_abstract=abstract, + artifact_description=description, license_url=license_url, distributions=distributions, )
269-281: Inconsistency: PR description doesn't match the actual changes.The PR description states: "The changes preserve CLI compatibility and only modify help strings." However, this file changes the public API function signature of
create_datasetby renaming parameters, which is a breaking change for any code that calls this function with keyword arguments.If the intent is truly to only update CLI help text without breaking the API, consider keeping the original parameter names in the function signature and only updating the CLI layer where argument parsing happens.
269-281: Critical: Incomplete parameter rename causes NameError and TypeError at runtime.The function signature renames
title,abstract, anddescriptiontoartifact_title,artifact_abstract, andartifact_description, but the refactoring is incomplete:
Function body references undefined variables (lines 383-387, 393-405): The code still uses the old parameter names, which will raise
NameErrorat runtime.All callers use old parameter names:
deploy_from_metadata()(line 501)databusclient/cli.py:84tests/test_deploy.py:94These will fail with
TypeError(unexpected keyword arguments).Docstring is outdated (lines 282-315): Still documents the old parameter names, creating API documentation mismatch.
Additionally, this is a breaking API change that contradicts the PR description's claim of "preserve CLI compatibility."
Required fixes
Update lines 383-387 and 393-405 to use the renamed parameters:
artifact_graph = { "@id": artifact_id, "@type": "Artifact", - "title": title, - "abstract": abstract, - "description": description, + "title": artifact_title, + "abstract": artifact_abstract, + "description": artifact_description, }dataset_graph = { "@type": ["Version", "Dataset"], "@id": _versionId, "hasVersion": version, - "title": title, - "abstract": abstract, - "description": description, + "title": artifact_title, + "abstract": artifact_abstract, + "description": artifact_description, "license": license_url, "distribution": distribution_list, }Update all callers in
deploy_from_metadata(),databusclient/cli.py, andtests/test_deploy.pyto use the new parameter names.Update the docstring (lines 282-315) to reflect the new parameter names.
📜 Review details
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
databusclient/api/deploy.pydatabusclient/cli.py
🚧 Files skipped from review as they are similar to previous changes (1)
- databusclient/cli.py
Description-
This PR updates the CLI help text for
--title,--abstract, and--descriptionto align with the DBpedia databus data model hierarchy.Currently, the CLI uses generic "Dataset" terminology, which causes confusion between Artifacts (the logical container) and Versions (the specific release). Users often mistake
--titlefor a version-specific name, or--abstractfor a global description.These changes clarify the scope of each argument without breaking existing CLI compatibility.
Changes
I have updated the help strings to explicitly map the arguments to their databus concepts:
--titleDataset titleArtifact Label: the permanent name of the data series (applies to all versions)--abstractDataset abstract max 200 charsVersion Abstract: a short summary (max 200 chars) specific to this timestamped release--descriptionDataset descriptionVersion Description: detailed metadata for this specific release (supports Markdown)Solves Issue #21
Summary by CodeRabbit
Documentation
✏️ Tip: You can customize this high-level summary in your review settings.